feature distillation
- Asia > China > Hong Kong (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Orange County > Anaheim (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Education (0.67)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Liaoning Province > Dalian (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
A Training and implementation details
We train the baseline ensembles for 200 epochs using SGD with momentum 0.9 and weight decay The initial learning rate is 0.1, and we decay it by 10 In our implementation, DVERGE starts from the trained baseline ensembles. Also note that GAL requires the ReLU function to be replaced with leaky ReLU to avoid gradient vanishing. Ensembles with adversarial training follow the baseline's training setup. We use 0.5 as the input transformation probability for M-DI As for previous methods, though ADP requires the least time budget, it does not improve the robustness much as shown in Figure 4. The number in the first column after the slash is the number of sub-models within the ensemble.
Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
Tasai, Ren, Li, Guang, Togo, Ren, Ogawa, Takahiro, Hirata, Kenji, Tang, Minghui, Yoshimura, Takaaki, Sugimori, Hiroyuki, Nishioka, Noriko, Shimizu, Yukie, Kudo, Kohsuke, Haseyama, Miki
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
- North America > United States (0.14)
- Africa > Togo (0.05)
- Europe > San Marino > Fiorentino > Fiorentino (0.04)
- (7 more...)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- (3 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.62)
Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
Li, Yichen, Wang, Xiuying, Xu, Wenchao, Wang, Haozhao, Qi, Yining, Dong, Jiahua, Li, Ruixuan
Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (4 more...)
BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
Sy, Yaya, Cerisara, Christophe, Illina, Irina
Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.40)
- Africa > Mali (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Liaoning Province > Dalian (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > China > Hong Kong (0.04)
- North America > United States > Virginia (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)